Andreas Beger
Isaac Chung
🐍 We are also the PyData Tallinn co-organizers.
https://github.com/andybega/dataframes-workshop
| x | y | group |
|---|---|---|
| 1 | 2 | a |
| 4 | 7 | b |
| 3 | 8 | a |
| 9 | 2 | b |
Imagine working with tabular data if we didn’t have dataframes and associated methods.
numpyimport numpy as np
import pandas as pd
df = pd.DataFrame({
"quarter": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
"x": np.random.randn(12),
"date": pd.date_range("2024-01-01", periods=12, freq="MS")
})
df.head()| quarter | x | date | |
|---|---|---|---|
| 0 | 1 | -1.125680 | 2024-01-01 |
| 1 | 1 | -0.112682 | 2024-02-01 |
| 2 | 1 | 0.259902 | 2024-03-01 |
| 3 | 2 | 0.426993 | 2024-04-01 |
| 4 | 2 | 0.669239 | 2024-05-01 |
Series
Estonian vehicle accident data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14259 entries, 0 to 14258
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 14259 non-null object
1 persons_involved 14259 non-null int64
2 killed 14259 non-null int64
3 injured 14259 non-null int64
4 county 14259 non-null object
5 pedestrian_involved 14259 non-null int64
6 accident_type 14259 non-null object
7 light_conditions 14259 non-null object
dtypes: int64(4), object(4)
memory usage: 891.3+ KB
Different ways, one is indexing with []:
Multiple columns
Right now date is stored as a string:
See notebook.
See notebook.
2017, Wes McKinney (creator of pandas):
10 Things I Hate About Pandas
| date | persons_involved | killed | injured | county |
|---|---|---|---|---|
| str | i64 | i64 | i64 | str |
| "2014-10-24 08:45:00" | 2 | 0 | 1 | "Harju maakond" |
| "2014-10-24 13:45:00" | 2 | 0 | 1 | "Harju maakond" |
| "2014-08-11 00:00:00" | 2 | 0 | 1 | "Harju maakond" |
| "2014-11-17 17:32:00" | 2 | 0 | 2 | "Harju maakond" |
| "2015-04-28 07:55:00" | 2 | 0 | 1 | "Harju maakond" |
But, we can always convert back and forth:
Expressions!
Expressions are abstract, composable data transformations that are executed with a context that provides data.
How many people were harmed in the biggest accident in our data?
select()filter()with_columns(): mutating dataframesgroup_by() and aggregationswith_columns() + expressions
accidents = accidents.with_columns(
pl.col("killed").add(pl.col("injured")).alias("killed_or_injured"),
pl.col("killed").add(pl.col("injured")).truediv(pl.col("persons_involved")).alias("harmed_rate")
)
accidents.select(["date", "persons_involved", "killed_or_injured", "harmed_rate"]).head(5)| date | persons_involved | killed_or_injured | harmed_rate |
|---|---|---|---|
| str | i64 | i64 | f64 |
| "2014-10-24 08:45:00" | 2 | 1 | 0.5 |
| "2014-10-24 13:45:00" | 2 | 1 | 0.5 |
| "2014-08-11 00:00:00" | 2 | 1 | 0.5 |
| "2014-11-17 17:32:00" | 2 | 2 | 1.0 |
| "2015-04-28 07:55:00" | 2 | 1 | 0.5 |
group_by() + agg() or with_columns()
by_county = (accidents
.group_by("county")
.agg(pl.len().alias("accidents"),
pl.col("killed_or_injured").sum())
.sort("accidents", descending=True)
)
by_county.head()| county | accidents | killed_or_injured |
|---|---|---|
| str | u32 | i64 |
| "Harju maakond" | 7000 | 8423 |
| "Tartu maakond" | 1591 | 1968 |
| "Pärnu maakond" | 1008 | 1293 |
| "Ida-Viru maakond" | 991 | 1348 |
| "Lääne-Viru maakond" | 645 | 883 |
(Optional)
county_pop = (pl.read_csv("data/county-pop.csv", skip_rows=2)
.rename({"County": "county", "Age groups total": "population"})
.select(["county", "population"])
# this has "county" in the county names, not "maakond"
.with_columns(pl.col("county").str.replace("county", "maakond"))
)
by_county_w_pop = by_county.join(county_pop, on="county", how="left")
by_county_w_pop.head(3)| county | accidents | killed_or_injured | population |
|---|---|---|---|
| str | u32 | i64 | i64 |
| "Harju maakond" | 7000 | 8423 | 598059 |
| "Tartu maakond" | 1591 | 1968 | 152977 |
| "Pärnu maakond" | 1008 | 1293 | 85938 |
Now we can use some simple select + expressions do to the math:
by_county_w_pop.select(
pl.col("county"),
pl.col("killed_or_injured"),
pl.col("killed_or_injured").truediv(pl.col("population")).mul(1000).alias("rate/1000")
).head(3)| county | killed_or_injured | rate/1000 |
|---|---|---|
| str | i64 | f64 |
| "Harju maakond" | 8423 | 14.083895 |
| "Tartu maakond" | 1968 | 12.864679 |
| "Pärnu maakond" | 1293 | 15.045731 |
We will use new data for this.
reflectors = (pl.read_csv("data/reflectors.csv", has_header=True, separator=";", skip_rows=2)
.filter(pl.col("Sex").ne("Men and women"))
.drop(["Type of data", "Year", "All age groups (16-64)"])
.sort("Reflector use", "Sex")
)
reflectors.head()| Reflector use | Sex | 16-24 | 25-34 | 35-44 | 45-54 | 55-64 |
|---|---|---|---|---|---|---|
| str | str | f64 | f64 | f64 | f64 | f64 |
| "Nearly always" | "Men" | 34.3 | 40.5 | 52.2 | 58.6 | 55.9 |
| "Nearly always" | "Women" | 58.4 | 64.9 | 71.4 | 78.4 | 74.9 |
| "Never" | "Men" | 14.3 | 12.4 | 7.2 | 3.9 | 2.7 |
| "Never" | "Women" | 8.8 | 5.0 | 4.6 | 2.0 | 2.5 |
| "Never walk on dark streets, ro… | "Men" | 4.8 | 10.8 | 9.7 | 11.3 | 12.8 |
reflectors = (reflectors
.unpivot(index=["Reflector use", "Sex"],
variable_name="age_group",
value_name="percentage")
.sort("Reflector use", "Sex", "age_group")
)
reflectors.head()| Reflector use | Sex | age_group | percentage |
|---|---|---|---|
| str | str | str | f64 |
| "Nearly always" | "Men" | "16-24" | 34.3 |
| "Nearly always" | "Men" | "25-34" | 40.5 |
| "Nearly always" | "Men" | "35-44" | 52.2 |
| "Nearly always" | "Men" | "45-54" | 58.6 |
| "Nearly always" | "Men" | "55-64" | 55.9 |
One category is “Never walk on dark streets, roads”…🧐
(reflectors
.with_columns(pl.col("Reflector use").str.replace("Never walk on dark streets, roads", "Never"))
.group_by(["Reflector use", "Sex", "age_group"])
.agg(pl.col("percentage").sum())
.filter(pl.col("Reflector use").eq("Never"))
.sort(["age_group", "Sex"])
.plot.line(x = "age_group", y = "percentage", color = "Sex")
.properties(width=700, height=300)
)🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖
pandas
polars
Scan this and let us know how we did 🤗